Why visualization?
Data visualization is an essential tool for urban analytics. It offers an intuitive and quick way to understand the characteristics and relationships embedded in your data. Sometimes a good data visualization can deliver a message more effectively than any other medium and exercise transformative power that shapes people’s perception see this example. And not to mention that cool visualizations are cool.
The first thing we analysts do when we get our hands on a dataset is to understand it. The best way to understand any given data is to visualize it. We have a wide variety of maps and charts to choose from, including scatter plot, histogram, box plot, violin plot, and mapping. We have been doing mapping a lot, so this document will focus more on plots.
library(tidyverse)
library(sf)
library(tmap)
library(leaflet)
library(tidycensus)
# Let's prepare data
yelp <- read_rds("https://raw.githubusercontent.com/ujhwang/urban-analytics-2024/main/Lab/module_2/week2/yelp_in.rds")
# Census data
census_var <- c(hhincome = 'B19019_001',
pop = "B02001_001",
race.white = "B02001_002",
race.black = 'B02001_003'
)
census <- get_acs(geography = "tract", state = "GA", county = c("Fulton", "DeKalb"),
output = "wide", geometry = TRUE, year = 2020,
variables = census_var)
summarise_mean <- c(str_c(names(census_var), "E"),
"rating", "review_count")
census_yelp <- census %>%
separate(col = NAME, into=c("tract","county","state"), sep=', ') %>%
# Spatial join
st_join(yelp %>%
mutate(n = 1,
price = nchar(price)) %>%
st_transform(crs = st_crs(census))) %>%
# Group_by
group_by(GEOID, county) %>%
# Mean for all census variables, sum for n
summarise(across(
all_of(summarise_mean), mean),
n = sum(n),
price = median(price)) %>%
# Release grouping
ungroup() %>%
# Drop 'E' from column names
rename_with(function(x) str_sub(x,1,nchar(x)-1), str_c(names(census_var), "E")) %>% # rename_with() renames with a function
# Replace NA in column n&review_count with 0
mutate(across(c(n, review_count), function(x) case_when(is.na(x) ~ 0, TRUE ~ x)))
As usual, using tmap to visualize the data.
tmap_mode("view");
a <- tm_shape(census_yelp) +
tm_????(col = "review_count", style = "quantile")
b <- tm_shape(yelp) +
tm_????(col = "review_count", style="quantile")
tmap_arrange(a,b, sync = TRUE)
## tmap mode set to interactive viewing
Or we can use leaflet() package for mapping.
library(leaflet)
library(htmlwidgets)
library(htmltools)
# CSS for title
tag.map.title <- tags$style(HTML("
.leaflet-control.map_title {
position: absolute;
left: 50px;
width: 320px;
text-align: left;
color: white;
padding-left: 10px;
background: rgba(200,200,200,0.2);
font-weight: bold;
font-size: 20px;
font-family: Helvetica;
border-color: white;
border-radius: 10px;
}"))
# Format title
title <- tags$div(
class="map_title", tag.map.title, HTML("<p>Restaurants in Fulton and DeKalb Counties from Yelp</p>")
)
# Color palette
fill_pal <- colorQuantile(palette = "YlOrRd", domain=yelp$review_count)
# Label for mouseover & popup
yelp_labels <- paste(
"<a href=",yelp$url,">",yelp$name, "</a><br>",
"<strong>Review Count: </strong>", yelp$review_count,"<br>",
"<strong>Rating: </strong>", yelp$rating) %>%
lapply(htmltools::HTML)
# Creating a Leaflet widget
leaflet() %>%
# Setting the view on load
setView(lng = -84.3903996350635, lat = 33.77074368998939, zoom = 11) %>%
# Dark base map
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
# Polygon boundary
addPolygons(data = census %>% st_union(),
opacity=0.2,
fillOpacity=0,
weight=1,
color="white") %>%
# Yelp point
addCircleMarkers(data = yelp,
radius = yelp$rating*1.5,
opacity=0.2,
fillColor=~fill_pal(review_count),
weight=1,
color= ~fill_pal(review_count),
popup= ~yelp_labels,
label= ~yelp_labels) %>%
# Legend
addLegend("bottomright", pal = fill_pal, values = yelp$review_count,title = "Review Count",opacity = 1) %>%
# Title
addControl(title, position="topleft",className="map_title")
For static maps with a lot of customizeability - ggplot2 can
also create maps. See this
example for an inspiration.
library(viridis)
## Loading required package: viridisLite
ggplot(data = yelp) +
geom_point(mapping = aes(color = log(review_count+1),
x = st_coordinates(yelp)[,1],
y = st_coordinates(yelp)[,2]),
size=2, alpha=0.9) +
theme_void() +
scale_color_viridis(trans = "log",
name="Review Count (logged)",
guide = guide_legend(keyheight = unit(3, units = "mm"),
keywidth=unit(12, units = "mm"),
label.position = "bottom",
title.position = 'top', nrow=1)) +
labs(
title = "Restaurant Distribution in Fulton & DeKalb Counties, GA",
subtitle = "Review Count from Yelp API"
) +
theme(
text = element_text(color = "#22211d"),
plot.background = element_rect(fill = "#f5f5f2", color = NA),
panel.background = element_rect(fill = "#f5f5f2", color = NA),
legend.background = element_rect(fill = "#f5f5f2", color = NA),
plot.title = element_text(size= 12, color = "black",
margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")),
plot.subtitle = element_text(size= 10, color = "black",
margin = margin(b = -0.1, t = 0.43, l = 2, unit = "cm")),
plot.caption = element_text( size=7, hjust=0, color = "black",
margin = margin(b = 0.3, r=-99, unit = "cm") )
) +
coord_map()
Scatter plot
In tmap package, two functions always go hand-in-hand, namely tm_shape() and tm_polygons() (or tm_lines, tm_dots, etc.). The tm_shape() function declares the data object to be displayed. Then, the tm_polygons() function defines the geometry shape and other associated characteristics.
A similar structure is used in ggplot2 package. Creating a plot needs
at least two functions that are connected by +: ggplot() function and
geom_point() (or other geometry types, such as geom_line, geom_box plot,
etc.). In the example below, ggplot(data = census_yelp)
indicates that we are drawing ggplot using yelp data. Then,
geom_point(aes(x = review_count, y = rating)) shows that we
are going to draw a scatter plot using review_count and
rating columns.
ggplot(data = ????) +
geom_point(mapping = aes(x = ????, y = ????))
Aesthetic mappings
We can add additional information to this plot using a few different
strategies, including colors, sizes, and shapes inside the
aes() – ‘Aesthetic mappings’ which describe how variables
in the data are mapped to visual properties (aesthetics) of geoms. For
example, we can add price information using color.
ggplot(data = ????) +
geom_point(mapping = aes(x=????, y=????,
color=????)) #<<
Or size.
ggplot(data = ????) +
geom_point(mapping = aes(x=????, y=????,
size=????)) #<<
Or transparency.
fig1 <- ggplot(data = ????) +
geom_point(mapping = aes(x=????, y=????,
alpha=price)) #<<
fig2 <- ggplot(data = ????) +
geom_point(mapping = aes(x=????, y=????,
alpha=price, #<<
size=price)) #<<
gridExtra::grid.arrange(????, ????, ncol= 1)
If you put, for example, color and size arguments outside of
aes(), the visual property from those arguments are not
based on your data.
fig2 <- ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating),
color = "orange") #<<
fig3 <- ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating),
size = 5) #<<
gridExtra::grid.arrange(fig2, fig3, ncol = 1)
Separating plots by categorical values
Use facet_wrap with a categorical variable. Remember
that if you use facet_wrap with a continuous data, it will
generate as many plots as the unique values in the continuous data.
Avoid writing such codes!
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating)) +
facet_wrap(~county) #<<
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating)) +
facet_grid(???? ~ ????) #<<
# One will be rows; the other will be columns.
# try swapping the two
Regression line
You can draw a regression line by using geom_smooth with
the “lm” method.
ggplot(data = census_yelp) +
geom_????(mapping = aes(x=review_count, y=rating), method = ????)
## `geom_smooth()` using formula = 'y ~ x'
More informative if you show both data points and the regression line.
# More than one layers
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating)) +
geom_????(mapping = aes(x=review_count, y=rating), method = ????)
## `geom_smooth()` using formula = 'y ~ x'
In the code above, we are repeating
aes(x=review_count, y=rating) twice. If we know that
mapping will be the same in multiple layers, we can define it in
ggplot().
# You don't need to repeat `mapping = aes(x=review_count, y=rating)` twice
ggplot(data = census_yelp, mapping = aes(x=review_count, y=rating)) + #<<
geom_point() +
geom_????(method = ????)
## `geom_smooth()` using formula = 'y ~ x'
If you want to add a specific mapping to a layer, you provide additional mapping to individual layers.
ggplot(data = census_yelp, mapping = aes(x=review_count, y=rating)) +
geom_point() +
geom_????(mapping = aes(color = ????), #<<
method = ????)
## `geom_smooth()` using formula = 'y ~ x'
You can append labs() to specify labels.
ggplot(data = census_yelp,
???? = aes(x=review_count, y=rating)) +
geom_point() +
geom_????(???? = aes(color = ????), method = ????) +
labs(x = "Review Count in Yelp", #<<
y = "Rating in Yelp",
color = "County in Census",
title = "Do better rated restaurants have more reviews?")
## `geom_smooth()` using formula = 'y ~ x'
Aesthetic options
Theme
We can change the overall theme of the plot using
theme_<...>.
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating, color = county)) +
labs(x = "Review Count in Yelp",
y = "Rating in Yelp",
color = "County in Census",
title = "Do better rated restaurants have more reviews?") +
theme_bw()
Of course, dark is always cooler.
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating, color = county)) +
labs(x = "Review Count in Yelp",
y = "Rating in Yelp",
color = "County in Census",
title = "Do better rated restaurants have more reviews?") +
ggdark::dark_theme_gray()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
Custom color scheme
If you want to use your custom color choices - for a discrete variable.
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating, color = county)) +
labs(x = "Review Count in Yelp",
y = "Rating in Yelp",
color = "County in Census",
title = "Do better rated restaurants have more reviews?") +
scale_color_manual(values = c("green", "darkblue")) + #<<
theme_bw()
If you want to use your custom color choices - for a continuous variable.
ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating, color = hhincome)) +
labs(x = "Review Count in Yelp",
y = "Rating in Yelp",
color = "Annual Household Income",
title = "Do better rated restaurants have more reviews?") +
scale_color_gradient(low="darkblue", high="red") + #<<
theme_bw()
Data labeling
Let’s label some of the outliers using ggrepel.
outliers <- census_yelp %>%
arrange(desc(review_count)) %>%
slice(1:4)
ggplot(data = census_yelp,
aes(x=review_count, y=rating)) + # moved aes() to here
geom_point(mapping = aes(color = ????)) + # Colored ones
geom_point(data = outliers, size = 3, shape = 1, color = "black") + # Black circles
ggrepel::geom_label_repel(data = census_yelp, mapping = aes(label = ????)) +
labs(x = "Review Count in Yelp",
y = "Rating in Yelp",
color = "Annual Household Income",
title = "Do better rated restaurants have more reviews?") +
scale_color_gradient(low="darkblue", high="red") + #<<
theme_light()
Interactive visualization
How can we convert these plots into interactive ones? The simplest
way is to use ggplotly() in plotly
package.
p <- ggplot(data = census_yelp) +
geom_point(mapping = aes(x=review_count, y=rating, color = hhincome)) +
labs(x = "Review Count in Yelp",
y = "Rating in Yelp",
color = "Annual Household Income",
title = "Do better rated restaurants have more reviews?") +
scale_color_gradient(low="darkblue", high="red") + #<<
theme_bw()
plotly::????(p)
3D scatter plot
Using plot_ly in plotly package, we can
also create a 3-dimensional scatter plot. It is particularly useful when
you want to examine associations between four variables.
plotly::????(census_yelp %>% st_drop_geometry(),
x = ~hhincome,
y = ~n,
z = ~rating,
color = ~price,
type = "scatter3d",
mode = "markers",
text = ~paste0("Avg. price: ", round(price, 2)),
hoverinfo = "text")
## Warning: Ignoring 139 observations
Bar plot
Bar plot is also very frequently used. Note that ggplot creates
y-axis automatically by examining how many rows there are for each
category of x. You can try
yelp %>% group_by(price) %>% tally() to check the
exact Y value for this plot.
ggplot(data = yelp) +
geom_bar(mapping = aes(x=price))
yelp %>%
st_set_geometry(NULL) %>%
group_by(price) %>%
tally()
## # A tibble: 4 × 2
## price n
## <chr> <int>
## 1 $ 1653
## 2 $$ 1850
## 3 $$$ 141
## 4 $$$$ 23
We can also further break each price level by another categorical
variable. We use rating to see the relative frequency of each rating for
each price level. This is done by adding fill=rating in the
mapping.
ggplot(data = yelp %>%
st_set_geometry(NULL) %>%
mutate(rating = round(rating,0) %>% #<<
factor(ordered = TRUE))) + #<<
# delete %>% factor(ordered = T) and see what happens
geom_bar(mapping = aes(x=????, fill=rating), position = "stack")
By changing position="stack" to
position="fill", we convert the Y-axis to the proportion
within each level of price and fill it up to the top. This shows more
clearly how different rating levels are distributed within each price
level.
ggplot(data = yelp %>%
st_set_geometry(NULL) %>%
mutate(rating = round(rating,0) %>%
factor(ordered = TRUE))) + #<<
geom_bar(mapping = ????(x=????, fill=rating), position = ????) #<<
We want rating=5 to be on the top because it is more intuitive to see
higher value on top. We can flip the bar plot vertically by adjusting
the levels when we declare rating variable
into a factor.
ggplot(data = yelp %>%
st_set_geometry(NULL) %>%
mutate(rating = round(rating,0) %>%
factor(levels = c(5,4,3,2,1), #<<
ordered = TRUE))) +
geom_bar(mapping = ????(x=????, fill=rating), position = ????)
Sometimes we want to see the exact figures on top of the bar plot. So, let’s add the percentage within each level of price as labels.
ggplot(data = yelp %>%
st_set_geometry(NULL) %>%
mutate(rating = round(rating,0) %>% factor(levels = c(5,4,3,2,1), ordered = TRUE))) +
geom_bar(???? = aes(x=????, fill=rating), position = ????) +
geom_text(data = . %>%
# Grouping to calculate % by price and by rating
group_by(price, rating) %>% #<<
# Count rows
tally() %>% #<<
# Convert to p
mutate(p = n / sum(n)) %>% #<<
# Re-order to match the order in bar plot
arrange(desc(rating)), #<<
aes(x = price, y = p, label = str_c(round(p,3)*100,"%")), color = "white",
position = position_stack(vjust=0.5)) +
ggdark::dark_theme_gray() # Dark theme because texts are not visible against white bg
You can flip it 90-degrees.
ggplot(data = yelp %>%
st_set_geometry(NULL) %>%
mutate(rating = round(rating,0) %>% factor(levels = c(5,4,3,2,1), ordered = TRUE))) +
geom_bar(???? = aes(x=????, fill=rating), position = ????) +
geom_text(data = . %>%
# Grouping to calculate % by price and by rating
group_by(price, rating) %>%
# Count rows
tally() %>%
# Convert to p
mutate(p = n / sum(n)) %>%
# Re-order to match the order in bar plot
arrange(desc(rating)),
aes(x = price, y = p, label = str_c(round(p,3)*100,"%")), color = "white",
position = position_stack(vjust=0.5)) +
coord_flip() + #<<
ggdark::dark_theme_gray() # Dark theme because texts are not visible against white bg
Customization example (Optional)
I saw this beautiful example by CÉDRIC SCHERER and wanted to show you a Yelp version of the code.
# Code & ideas borrowed heavily from CÉDRIC SCHERER's personal website:
# https://www.cedricscherer.com/2021/07/05/a-quick-how-to-on-labelling-bar-graphs-in-ggplot2/
max_city_n <- 10
rest_by_city <- yelp %>%
st_set_geometry(NULL) %>%
group_by(location.city) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:max_city_n) %>%
mutate(location.city = factor(location.city, levels = .$location.city[seq(max_city_n,1)])) %>%
# Format text label
mutate(pct = scales::percent(n / sum(n),accuracy = 0.1),
pct = case_when(row_number() == 1 ~ str_c(pct, " of all businesses"), TRUE ~ pct)) %>%
# Define aesthetic properties - label location
mutate(nudge = case_when(row_number()==1 ~ 1.05, TRUE ~ -0.2)) %>%
# Define aesthetic properties - color
mutate(color = case_when(row_number()==1 ~ "gray30", TRUE ~ "gray70")) %>%
# Color palette
mutate(pal = c(rep('gray70', max_city_n-4), "coral2", "mediumpurple1", "mediumpurple1", "goldenrod1")) %>%
# with() is required to be able to call variables with referencing to data frame
with(
# ggplot
ggplot(data = .) +
# Bars
geom_col(mapping = aes(y = location.city, x = n, fill = location.city)) +
# Text
geom_text(mapping = aes(y = location.city, x = n, label = pct),
# Calling aesthetic properties defined above
hjust=nudge, color=color,
# Font styling
fontface="bold.italic") +
# Stretch x axis
scale_x_continuous(limits = c(NA, 2200)) +
# Custom palette
scale_fill_manual(values = pal, guide="none") +
# Labels
labs(x = "Count", y = "Cities", title = "Top 10 cities with most restaurants in Fulton & DeKalb Counties\n") +
# Dark theme
ggdark::dark_theme_classic()
)
rest_by_city
Histogram, box plot, & violin plot
Histogram displays the distribution of a variable. It first assigns observations (i.e., rows) of the given variable (i.e., a column) into bins (0~99, 100~199, 200~299, etc.) and count the number of observations that fall into each bin. So, the taller a bar is, more observations fell into that bin. For example, in the histogram below, the first bar that touches 0 is much taller than other bars because most of the observations had zero or near-zero reviews.
Let’s examine the distribution of review counts using histogram.
ggplot(census_yelp) +
geom_histogram(mapping = ????(x = ????))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(census_yelp) +
geom_histogram(mapping = ????(x = ????),
bins = 60) #<< increase the number of bins from 30 to 50. You get more bars.
ggplot(census_yelp) +
geom_histogram(mapping = ????(x = ????),
bins = 60,
color="black") #<< color of the outline
ggplot(data = census_yelp) +
geom_histogram(mapping = ????(x = ????, fill=county), #<<
bins = 60,
color="black") +
scale_x_continuous(breaks=seq(0,1300, by=100))
ggplot(data = census_yelp) +
geom_histogram(mapping = ????(x = ????, fill=county),
bins = 60,
color="black",
position = "identity", #<<
alpha = 0.2) #<<
scale_x_continuous(breaks=seq(0,1300, by=100))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
ggplot(data = census_yelp) +
geom_histogram(mapping = ????(x = ????, fill=county),
bins = 60,
color="black",
position = "dodge") + #<< #<<
scale_x_continuous(breaks=seq(0,1300, by=100))
ggplot(data = census_yelp) +
geom_histogram(mapping = ????(x = ????, fill=county),
bins = 60,
color="black",
position = "dodge") + #<< #<<
scale_x_continuous(breaks=seq(0,1300, by=100)) +
scale_fill_manual(values=c("#999999", "#E69F00"))
Although the histogram above allows us to effectively see the
distribution, comparing multiple distributions are
often easier with a box plot. You need to understand what each component
of a box plot means to read it properly. plotly::ggplotly()
provides a good interactive visualization about how to read the plot.
Note that upper fence = Q3 + (1.5*IQR), where IQR is interquartile range
(Q3-Q1). The lower fence is Q1 - (1.5*IQR).
bxplot <- ggplot(data = yelp) +
geom_boxplot(aes(x=price, y=review_count),
color="black",fill="white")
plotly::????(bxplot)
a <- ggplot(data = yelp) +
geom_boxplot(aes(x=price, y=review_count),
fill = "white", color = "black")
b <- ggplot(data = yelp) +
geom_boxplot(aes(x=review_count, y=price),
fill="white", color="black")
gridExtra::grid.arrange(a, b)
Of course, you can use scale_fill_manual() to use your
custom color. To do this, however, you need to have specified
fill inside aes(). If it is not specified at
all or is specified outside of aes(), custom color won’t
work.
yelp %>%
st_join(census %>% st_transform(crs = st_crs(yelp))) %>%
separate(col = NAME, into=c("tract","county","state"),sep=", ") %>%
drop_na(county) %>%
ggplot() +
geom_boxplot(aes(x = price, y=review_count, fill = price),
color = "black") +
facet_wrap(~county) +
scale_fill_brewer(palette = "Blues")
Violin plot is yet another plot for visualizing the distribution of variables. While a box plot allows you to see where the upper/lower fences are and where the median and quartiles are, a violin plot allows you to see the concentration of observations at a certain bins (or range if you will) of values.
vplot <- ggplot(data = yelp %>%
st_join(census %>% st_transform(crs = st_crs(yelp))) %>%
separate(col = NAME, into=c("tract","county","state"),sep=", ") %>%
drop_na(county)) +
geom_violin(aes(x = price, y=review_count, fill = price),
color = "black") +
facet_wrap(~county) +
scale_fill_brewer(palette = "Blues")
plotly::ggplotly(vplot)
So, do restaurants in wealthy neighborhoods get higher Yelp ratings?
As we’ve seen so far, data visualization is an indispensable tool for urban analysts. If we have some time left in the class, think about socioeconomic/demographic variables in American Community Survey that may associate with review count, rating, and/or price of Yelp data. Then, get the data, plot them, and draw insights from them.
Association between household income and Yelp ratings.
census_yelp %>%
mutate(review_count_cut = cut(review_count, breaks = quantile(review_count, prob = c(0,0.5,0.75,1)), include.lowest=TRUE)) %>%
ggplot(data = ., aes(x = ????, y = ????)) +
geom_point(mapping = aes(color = review_count_cut)) +
scale_color_manual(values = c("gray50", "orange", "red"), labels = c("0 - 50th", "50th- 75th", "75th - 100th")) +
labs(x = "Annual Household Income", y = "Yelp Rating", color = "Review Count (discrete)", title = "Household Income vs. Rating") +
ggdark::dark_theme_gray() +
# ------------------------------------------------------------------
# This line of code adds the correlation analysis result to the plot
ggpubr::stat_cor(method = "pearson", label.x = 160000, label.y = 1.5)
Association between White population(%) and Yelp ratings.
census_yelp %>%
mutate(review_count_cut = cut(review_count, breaks = quantile(review_count, prob = c(0,0.5,0.75,1)), include.lowest=TRUE)) %>%
mutate(pct_white = race.white / pop) %>%
ggplot(data = ., aes(x = ????, y = ????)) +
geom_point(mapping = aes(color = review_count_cut)) +
scale_color_manual(values = c("gray50", "orange", "red"), labels = c("0 - 50th", "50th- 75th", "75th - 100th")) +
labs(x = "Proportion of White Residents", y = "Yelp Rating", color = "Review Count (discrete)", title = "Proportion of White Residents vs. Rating") +
ggdark::dark_theme_gray() +
# ------------------------------------------------------------------
# This line of code adds the correlation analysis result to the plot
ggpubr::stat_cor(method = "pearson", label.x = 0.6, label.y = 1.5)